Linguistically Fuelled Text Similarity

نویسندگان

  • Björn Andrist
  • Martin Duneld
چکیده

This paper describes TEXTSIM, a system for determining the similarity between texts. Further, we show the results of a comparison between two various configurations of TEXTSIM; one with and one without any deeper linguistic analysis. To evaluate and compare the two models of TEXTSIM we used two sets of examples: a set of automatically generated examples and a set of examples acquired from two assessors. Depending on the type of documents, we found the model using linguistic analysis to perform equally well or better than the model not using linguistic analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How Noisy Social Media Text, How Diffrnt Social Media Sources?

While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, whi...

متن کامل

Baldwin, Timothy, Paul Cook, Marco Lui, Andrew MacKinlay and Li Wang (to appear) How Noisy Social Media Text, How Diffrnt Social Media Sources?, In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), Nagoya, Japan

While various claims have been made about text in social media text being noisy, but there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia,...

متن کامل

Effects of Creativity and Cluster Tightness on Short Text Clustering Performance

Properties of corpora, such as the diversity of vocabulary and how tightly related texts cluster together, impact the best way to cluster short texts. We examine several such properties in a variety of corpora and track their effects on various combinations of similarity metrics and clustering algorithms. We show that semantic similarity metrics outperform traditional n-gram and dependency simi...

متن کامل

Linguistically Optimized Text Entry on a Mobile Phone

We present an analysis of linguistically optimized text entry on mobile phones. This analysis compares the behavior of a linguistically optimized system with wordbased dis ambiguation methods. Through theoretical analysis, it is shown that in real-world situations in which typing errors are common and dictionaries are incomplete, the speed of text -entry using a word-guessing method degrades to...

متن کامل

Bridging the Gap between Domain-Oriented and Linguistically-Oriented Semantics

This paper compares domain-oriented and linguistically-oriented semantics, based on the GENIA event corpus and FrameNet. While the domain-oriented semantic structures are direct targets of Text Mining (TM), their extraction from text is not straghtforward due to the diversity of linguistic expressions. The extraction of linguistically-oriented semactics is more straghtforward, and has been stud...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007